Comparison study of structured commits

02 février 2024

Authors

We are four students in M2 or in last year of Polytech’ Nice-Sophia specialized in Software Architecture :

I. Research context

The main question we chose to address is : How and to what extent are structured commits used in open-source projects ? At first, we wanted to understand the impact of commit standards on the management of software projects (clarity, traceability, maintenance). but we soon realized it is a really vast subject, and most of our interrogations couldn’t be answered with the tools we had. How can we attest how useful a structured commit was for a project ? Or how much time was saved thanks to a clear commit message ? So we changed some questions, and these are the under questions we kept :

II. General question

Prior to embarking on this project, none of us were familiar with conventional commits, sparking our curiosity about its prevalence and significance in the open-source community. As we delved into the investigation, we quickly observed that a majority of prominent open-source projects adopted conventional commits as a standard practice. This led us to ponder the reasons behind its widespread adoption. Conventional commits emerged as a straightforward and transparent means of communicating code changes and fixes. The simplicity and clarity of this convention became evident, allowing developers to articulate alterations and provide context with concise messages. By encapsulating both the “what” and “why” of code modifications in a brief message, conventional commits offer a qualitative and beneficial communication channel for developers collaborating on the same codebase. Interestingly, despite our initial lack of exposure to conventional commits, we recognized that structured commit messages were not entirely foreign to us, particularly in our experiences working within companies. This realization prompted us to question the necessity of adhering strictly to the conventional commit standard. This raises an intriguing inquiry: Is the stringent adherence to conventional commits truly imperative, considering that structured commit messages are already employed in professional settings? Our exploration into this aspect seeks to uncover the nuanced perspectives and practical implications of adopting and adhering to the conventional commit standard in the broader context of software development.

III. Information gathering

We gathered information from several sources. We started with the documentation on commit convention that we didn’t know about to be able to recognize it in projects : conventional commits and gitmojis. We also read the suggested article, “What makes a good commit message?”. This article defines a good commit message by a message explaining “what” changes were made and “why”. The conventional commit definitely helps to express these two elements in a commit message, so it is a good start for the question : “why is the conventional commit so popular ?”. We collected data from open source Github repositories ; we analyzed them by using “pydriller”, a python’s module, to create an algorithm able to determine the percentage of structured commits in a project. We used Jupyter notebooks to share codes and graph the commit patterns among different projects. We also studied some smaller projects by hand to test and ensure the accuracy of the tool. To choose the best data to analyze, we picked big open-source projects that are relatively well-known (filtered by number of stars, watchers, ect…), projects that were able to be easily analyzed and well known companies that have multiple open-source projects. We decided not to study some projects like Linux or Rust, because it was too consuming for our tool which is meant to be lightweight. Huge Commit number, won’t be able to test and debug our tools for them easily and it would cause some testing issues. For the initial phase, we’ve not studied projects with unclear structures like React, although we’ve decided to study them for a later question.

IV. Hypothesis & Experiences

Here are our hypothesis on the influence of structured commits on the maintenance and understanding of projects :

In our experimentation, we conducted both quantitative and qualitative analyses of commit practices across diverse projects. While identifying large projects proved relatively straightforward, the challenge lay in locating projects with specific commit structures, particularly those exclusively using Gitmojis. Meta projects posed a distinct difficulty, given the non-public and ambiguous nature of their conventions. Initially opting to avoid them, we later reconsidered and embarked on a manual examination of their commits. This approach, relying on our own interpretations, undoubtedly introduced a lower level of accuracy. In the context of Hypothesis 4, our initial plan involved testing all projects of a single company with a unified commit structure to assess the consistency across projects. However, this approach was deemed impractical for Apache, as they explicitly detailed unique structures for each project. Consequently, we opted to examine the extent to which Apache projects adhere to their prescribed conventions.

V. Result Analysis and Conclusion

1. Commit Analysis Overview

Figure 2: Angular Conventional Commits

Angular demonstrates a strong adherence to conventional commits, which is expected given that Angular has been instrumental in establishing these standards.

Figure 3: Node Conventional Commits

It’s evident that not all repositories align with conventional commit guidelines. As previously mentioned, many opt for custom conventions. This necessitated adaptations in our analysis tool to accommodate the unique conventions of Node.js.

Figure 4: Node Updated Conventions

A significant number of repositories prefer a “verb-start” approach for commit messages, initiating each entry with an action verb. This approach posed analytical challenges, which were overcome by employing Natural Language Processing techniques. Django serves as a notable case, eschewing conventional commits in favor of exclusively using “verb-start” commits:

Figure 13: Django Conventional Commits Figure 14: Django Updated Commits

Regarding the use of gitmojis, few open-source projects consistently incorporate them. The notable exceptions were Gitmoji itself and FastAPI.

Figure 5: Gitmoji Conventional Commits Figure 6: FastAPI Conventional Commits

An interesting observation is the sporadic adoption and subsequent abandonment of gitmojis in projects, with CPython being a prime example of this trend.

Figure 7: CPython Conventional Commits

Conclusion

Our comprehensive analysis across various repositories indicates a predominant preference for bespoke commit conventions over standardized ones. These custom conventions often diverge significantly from established norms, underscoring the diversity in commit practices across the software development landscape. In short: * Companies would rather use their own commit structure rather than conventional commits * Even when adapted, Companies usually don’t stick to using community driven commits like conventional-commits or gitmoji and usually end up swapping them. * Gitmoji never used (except gitmoji repository or on a short period then canceled) * Use of automatic commits depending on the project (merge, pull requests, squash, etc.) which makes structuring of commits less necessary because they keep the discussion in the PR of the repo leaders

2. Top Contributors overview

3. Comparing Apache and Meta

To answer our 4th hypothesis, we’ve decided to take the companies Apacha and Meta. Right off the bat, we can see that our hypothesis was incorrect, as we found that apache defined different conventions for each of its projects in the contributing.md file. So our outlook on this changed, We’ve decided to compare the attitudes and approach that each company had with their commits.

Figure 10: React conventional commits Figure 11: React Native conventional commits

Conclusion

Apache is comparatively much stricter and much more explicit about their conventions. Apache has no “ultimate” convention, it defines a convention for each project and they’re strict for each convention per project. This doesn’t seem to be the case with Meta. Not only was meta’s conventions unknown. Meta claimed to squash commits so users don’t have to worry about commit structure, this implies that meta’s commits are all automated and not human. However this does not seem to be the case from our studies. much more unconventional or “unprofessional” commits found in Meta’s commits which make it clear that it was human made. A prominent example is the following :

Figure 12: React non conventional commit

Meta has been very difficult to study from the contradicting results, to the very secretive conventions in comparison to Apache’s very straightforward approach to commits. * Companies with multiple projects (Meta & Apache) don’t use the same conventions for all their projects. It seems like each team personalizes their commits to each team’s requirements

4. Limitations

5. Concluding Thoughts

This study aimed to shed light on different commit practices but also highlights the intricate nature of software development workflows, underscoring the need for further, more nuanced research in this area.

VI. Tools

Our tool detects conventions by using regex for each convention message. This is simply because most commit conventions require having specific terms at the beginning of a commit. Our tool also uses Natural Language Processing to detect some other convention patterns. In fact, many projects have explicitly stated that their convention is simply “have a verb at the beginning of the commit”, some have specified the verb to be in Present form, some have specified in past tense, but most have just specified that it just needs to be a verb. The way most conventions are tweaked is by modifying the subsystems within the regex as most projects use a variation of that same format. While we have done our studies on multiple projects, we have kept only the most pertinent results in the notebook provided.

In summary, our tool employs a combination of regex and NLP techniques to detect commit message conventions. The regex is adept at capturing structured patterns, while NLP allows for the interpretation of conventions expressed in natural language. The tool’s adaptability is further enhanced through the ability to modify subsystems within the regex, ensuring its applicability across a range of convention variations. The results presented in the notebook have been curated to emphasize the most pertinent insights derived from the broader study.

VI. References

Figure 1: Logo UCA, exemple, vous pouvez l'enlever